1 Setup

Today we’ll need the following R libraries, if you get an error, make sure you have all of them installed with install.packages() function.

library(tidyverse)
library(RColorBrewer) # to create color scales
library(ggrepel)      # to crate nice labels for gene names
library(scales)       # to look at color we created

2 Creating a customized volcano plot

Last lecture we looked at point plots and how to plot them based on data.
Today we will look at a customized volcano plot like the one below:

What is a volcano plot?
A volcano plot is a type of scatter-plot that is used to quickly identify changes in large data sets composed of replicate data. It plots significance versus fold-change on the y and x axes, respectively. These plots are increasingly common in omic experiments such as genomics, proteomics, and metabolomics where one often has a list of many thousands of replicate data points (e.g. genes, proteins, metabolites) between two conditions (e.g. normal vs sick) and one wishes to quickly identify the most meaningful changes. In this example we’re looking gene expression (process that converts sequence of a gene in the DNA to messenger RNA).

2.1 Dataset for the volcano plot

Dataset we’ll use comes from an experiment where mice were exposed to x-ray (this is also used as a treatment to kill cancer cells) . Then messenger RNA (mRNA) levels were measured from ~27,000 genes in spleen tissue in both control and exposed mice. The table below shows the results from a statistical test comparing these two groups across ~27,000 genes. Conducting the same comparison test with 27,00 genes leads to (multiple comparison (testing) problem)[https://en.wikipedia.org/wiki/Multiple_comparisons_problem]. To overcome this problem and detect the true positives, multiple testing correction was applied to the results, which gives the false discovery rate (FDR).

genes.Spleen.Xray <- read_csv("data/results_miceSpleen_xray.csv")
head(genes.Spleen.Xray)
GENE logFC PValue FDR
Fam212b 4.554160 0 0
Mdm2 1.638605 0 0
Ccng1 3.060737 0 0
Polk 2.582634 0 0
Aen 2.606176 0 0
Slc19a2 2.468342 0 0

This data format is already in appropriate format for our plotting purposes.

We will only add columns:
- logFDR: Negative \(log_{10}\) transformed FDR (false discovery rate) values
- threshold: defining whether a gene qualifies for the set threshold. In this case, it is any gene that satisfies:
- \(FDR < 0.1\) (which is also same as \(logFDR > 1\)) and \(|\ logFC\ | > 0.4\)

We will also pick the top 10 genes and assign as a separate data frame for labels in the plot.

pdat <- genes.Spleen.Xray %>%
  mutate(logFDR = -log10(FDR),
         threshold = ifelse(FDR < 0.1 & abs(logFC) > .4, TRUE, FALSE) )

#subset top 10 genes to add labels later to our plot
data.label <- pdat %>%
  filter(threshold == TRUE) %>%
  top_n(n = 10, wt = logFDR)
data.label
GENE logFC PValue FDR logFDR threshold
Fam212b 4.554160 0 0 10.962238 TRUE
Mdm2 1.638605 0 0 9.834850 TRUE
Ccng1 3.060737 0 0 9.834850 TRUE
Polk 2.582634 0 0 9.834850 TRUE
Aen 2.606176 0 0 9.741446 TRUE
Slc19a2 2.468342 0 0 9.741446 TRUE
Zfp365 5.288498 0 0 9.129395 TRUE
Prrg4 4.691077 0 0 8.636292 TRUE
Phlda3 4.100608 0 0 8.497192 TRUE
Zmat3 2.867346 0 0 8.402769 TRUE
Bax 1.991250 0 0 8.402769 TRUE

2.2 geom_point layers

In the plot any point below our threshold has different aesthetics mapped to them!! Therefore, we will add two layers of geom_point and provide different datasets and aesthetic mappings to each layer.

One of the differences between these two layers is the shape of the point.

Plotting ‘character’ (pch) There are multiple shapes available for plotting points. Take a look with:

?pch

Back to our plot…

vp <- ggplot(pdat, aes(x = logFC, y = logFDR)) +
  geom_point(data = filter(pdat, threshold == FALSE),
             size=.35, alpha = .55, colour = "grey80") +
  geom_point(data = filter(pdat, threshold == TRUE),
             aes(size = logFDR, fill=logFDR),
             pch = 21, stroke = .25, alpha = .65, color = "grey59")
vp

2.3 Let’s talk about colors

?brewer_pal
brewer.pal(6, "PuRd")
## [1] "#F1EEF6" "#D4B9DA" "#C994C7" "#DF65B0" "#DD1C77" "#980043"
display.brewer.pal(6, "PuRd")

myPallete <- colorRampPalette(brewer.pal(8,"PuRd"))
myPallete(10)
##  [1] "#F7F4F9" "#EAE5F1" "#DCCAE3" "#D0ACD3" "#CB8EC4" "#DC6AB2" "#E43C96"
##  [8] "#DB1E72" "#C00E50" "#91003F"
show_col(myPallete(10))

show_col(myPallete(100), labels = FALSE)

2.4 Customize color scales, legend & axes labels and plot title

vp <- vp +
  scale_fill_gradientn(guide="colourbar", colours = myPallete(100)) +
  guides(size=FALSE,
         color=FALSE,
         fill=guide_colourbar(title = expression(-log[10](FDR)))) +
  labs(x = "Fold Change",
       y = expression(-log[10](FDR))) +
  geom_hline(yintercept=1, color="black", size=1) +
  annotate(geom = "text", x=-6, y=0.75, label="italic(FDR) == 0.1",parse=TRUE) +
    ggtitle("A Volcano Plot!")

vp

2.5 Add labels for top 10 genes

vp <- vp  +
  geom_label_repel(data = data.label,
                   aes(label = GENE), alpha = .95,
                   size = 4, color = "black",
                   box.padding = 0.5, segment.size = .5,
                   arrow = arrow(length = unit(0.015, 'npc')))


vp

2.6 Make cosmetic changes with theme()

vp <- vp  +
  theme_minimal() +
  theme(plot.title = element_text(lineheight = 1,face = "bold",size = 20),
           axis.text = element_text(size = 12),
           axis.title = element_text(size = 16, face = "bold"),
           legend.text=element_text(size=13),
           legend.position="bottom",
           legend.title = element_text(size = 12,face = "bold"),
           legend.key.size = unit(0.035, "npc"),
           strip.text.x = element_text(size = 16, colour = "black"),
           strip.text.y = element_text(size = 16, colour = "black"))

vp

2.7 Save plot

Default is PDF format but many graphic devices are available.

?ggsave

ggsave( "volcano_plot.pdf",
        vp, width = 10, height = 8)

ggsave( "volcano_plot.jpeg",
        vp, width = 10, height = 8, device = "jpeg")

3 Other plot types

3.1 mpg data

ggplot2 dataset, you need to first load tidyverse or ggplot2

Fuel economy data from 1999 and 2008 for 38 popular models of car

?mpg
head(mpg,3)
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
mpg$class[1:20]
##  [1] "compact" "compact" "compact" "compact" "compact" "compact" "compact"
##  [8] "compact" "compact" "compact" "compact" "compact" "compact" "compact"
## [15] "compact" "midsize" "midsize" "midsize" "suv"     "suv"

3.2 Barplots

geom_bar is designed to make it easy to create bar charts that show counts. For example number of cars in each class

ggplot(mpg, aes(manufacturer)) +
  geom_bar() +
    theme(axis.text.x = element_text(size = 10, angle = 45))

3.2.1 Stacked

To plot sub categories in each group, use aes(fill = x) function. This will let you color the bars. Bar charts are automatically stacked when multiple bars are “defined”" at the same location.

ggplot(mpg, aes(manufacturer)) +
 geom_bar(aes(fill = class)) +
  theme(axis.text.x = element_text(size = 10, angle = 45))

3.2.2 Side by side

To have side by side plots for the groups you would like to compare, you need to use the position argument within the geom_ functions. You can either simply assign the name of positioning as a string or assign a position_ function for more details.

ggplot(mpg, aes(manufacturer)) +
 geom_bar(aes(fill = class), position = "dodge")

ggplot(mpg, aes(manufacturer)) +
  geom_bar(aes(fill = class),position = position_dodge(width = 0.4))

3.3 Visualizing data points or mean as barplot

Plotting mean or other statistics of a variable directly with ggplot2 is less straightforward. One way is to use stat_summary_bin function.

?stat_summary_bin

ggplot(mpg, aes(class)) +
  stat_summary(aes(y = displ), fun.y = "mean", geom = "bar")

A more straightforward approach is to use geom_col.

3.3.1 geom_col function

If you would like to plot means or any other values other than the count of categories you need to provide them to ggplot() in the following format:

Assuming that outcome is the mean outcome of treatments (trt) a, b and c here.

df <- data.frame(trt = c("a", "b", "c"), outcome = c(2.3, 1.9, 3.2))
df
trt outcome
a 2.3
b 1.9
c 3.2

Then use geom_col()

ggplot(df, aes(trt, outcome)) +
  geom_col()

You can also change colors easily

ggplot(df, aes(trt, outcome)) +
  geom_col(fill="steelblue")

ggplot(df, aes(trt, outcome)) +
  geom_col(aes(fill=trt))

Another nice trick is to take advantage of group_by and summarise functions.

mpg %>%
  group_by(class) %>%
  summarise(cty_mean = mean(cty)) %>%
  ungroup() %>%
  ggplot(aes(class, cty_mean)) +
  geom_col()

3.4 geom_bar() with continuous data

You can use geom_bar() with continuous data, in which case it will values and show counts at unique locations. (But geom_histogram is more appropriate for this task.)

ggplot(mpg, aes(displ)) + geom_bar()

4 Histograms

Histograms are basically barplots with “bins”. To construct a histogram,

  1. “bin” the range of values. What is meant by “binning” is, to divide the entire range of values into a series of intervals.
  2. Then count how many values fall into each interval.

The bins are usually specified as consecutive, non-overlapping intervals of a variable.

ggplot(iris, aes(Sepal.Length)) + geom_histogram(binwidth = 0.2)

ggplot(iris, aes(Sepal.Length, fill=Species)) +
  geom_histogram(binwidth = 0.5)

ggplot(iris, aes(Sepal.Length, fill=Species)) +
  geom_histogram(binwidth = 0.5, alpha=0.4)

ggplot(iris, aes(Sepal.Length, fill=Species)) +
  geom_histogram(binwidth = 0.5, position = "identity", alpha=0.4)

5 Boxplots

A box plot is a easy way to get some overall idea about the distribution of your variable. Let’s have a quick review on how to read a box plot:

Boxplot (with an interquartile range) and a probability density function (pdf) of a Normal N(0,σ2) Population

Boxplot (with an interquartile range) and a probability density function (pdf) of a Normal N(0,σ2) Population

5.1 Example ToothGrowth dataset

head(ToothGrowth)
len supp dose
4.2 VC 0.5
11.5 VC 0.5
7.3 VC 0.5
5.8 VC 0.5
6.4 VC 0.5
10.0 VC 0.5
?ToothGrowth

Here’s how the the distribution of tooth length for different regimens look like.

gb <- ToothGrowth %>%
      unite(regimen, -len, sep= "_") %>%
      ggplot(data=., aes(regimen, len)) +
      geom_boxplot(aes(color=regimen))

6 Plotting means and error bars

The error bars can define multiple types of variation:

  1. Standard deviation (sd) shows how much each data point deviate from the mean “on average”. This is useful if you would like to have a general idea about how the data is distributed.

  2. Standard error of the mean (sem) gives a value explaining how much your estimated mean is affected by the sampling variability.

  3. 95% Confidence interval (ci95) the interval which you are 95% certain that the true mean (or any estimated value/statistic) falls into. This is probably the best type of variation if we are really interested in estimating the mean.

As mentioned earlier, ggplot2 it’s easier to provide calculated summary statistics for plotting, rather than using the stats_ functions.

So, now let’s compute the mean, sd and se using summarise function.

6.1 Compute stats for ToothGrowth

tg.stats <- ToothGrowth %>%
  group_by(supp, dose) %>% # group by
  summarise(.,
            mean.len = mean(len),  # mean
            sd.len = sd(len),      # sd
            n = length(len)        # count
            ) %>% 
  ungroup() %>%
  mutate(se.len = sd.len/sqrt(n))

Now this is how tg.stats look

tg.stats
supp dose mean.len sd.len n se.len
OJ 0.5 13.23 4.459708 10 1.4102837
OJ 1.0 22.70 3.910953 10 1.2367520
OJ 2.0 26.06 2.655058 10 0.8396031
VC 0.5 7.98 2.746634 10 0.8685620
VC 1.0 16.77 2.515309 10 0.7954104
VC 2.0 26.14 4.797731 10 1.5171757

Finally we can start plotting! Let’s start with line graphs.

6.2 Point and Line graphs

Step by step:

6.2.1 1. Just plot mean tooth length as data points

For points, shape is the same as pch.

ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
    geom_point(aes(x=dose, y= mean.len), size=4, shape=18)

6.2.2 2. Add the error bars: Show standard deviation

ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
  geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
  geom_errorbar(aes(ymin=mean.len-sd.len, ymax=mean.len+sd.len), width=.1)

6.2.3 3. Add lines

ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
  geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
  geom_errorbar(aes(ymin=mean.len-sd.len, ymax=mean.len+sd.len), width=.1) +
  geom_line()

6.2.4 4. Add original data points

Just specify the original data set in another layer of geom_plot. Just like the volcano plot!

ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
  geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
  geom_errorbar(aes(ymin=mean.len-sd.len, ymax=mean.len+sd.len), width=.1) +
  geom_line() +
  # here I add the original data points!
  geom_point(data=ToothGrowth, aes(x=dose, y=len, fill=supp), alpha=0.4, size=1.5)

Now that we have more or less the plot we want, let’s customize it further.

7 Customizing graphs

7.1 1. Change titles and set scale limits

Modify axis, legend, and plot labels see Reference ggplot2: Scale

ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
  geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
  geom_errorbar(aes(ymin=mean.len-sd.len, ymax=mean.len+sd.len), width=.1) +
  geom_line() +
  geom_point(data=ToothGrowth, aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5) +

# Here I add the titles I want
  labs(x="Supplement Dose",
       y="Tooth Length",
       title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
       subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
       caption = "Based on the data from R")

Define the limits of axes

ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
  geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
  geom_errorbar(aes(ymin=mean.len-sd.len, ymax=mean.len+sd.len), width=.1) +
  geom_line() +
  geom_point(data=ToothGrowth, aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5) +

# Here I add the titles I want
  labs(x="Supplement Dose",
       y="Tooth Length",
       title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
       subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
       caption = "Based on the data from R") +

# Change x and y axes limits
  xlim(0, 3) +
  ylim(0, 60)

7.2 2. Change legend title

We saw in the previous session how to change legend parameters with guides() function. To change the legend title we can use the same function. See (Reference ggplot2 Guides)[http://ggplot2.tidyverse.org/reference/index.html#section-guides-axes-and-legends]

ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
  geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
  geom_errorbar(aes(ymin=mean.len-sd.len, ymax=mean.len+sd.len), width=.1) +
  geom_line() +
  geom_point(data=ToothGrowth, aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5) +

# Here I add the titles I want
  labs(x="Supplement Dose",
       y="Tooth Length",
       title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
       subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
       caption = "Based on the data from R") +

# Change the legend
  guides(colour=guide_legend("Supplement"), fill=guide_legend("Supplement"))

7.3 3. Jitter points

I want to customize the jitter of the points. To adjust positions look at position adjustment in ggplot2 Reference

pd <- position_dodge(0.1)

# my plot is getting very long, I'll assign it
tgp <- ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
  geom_point(aes(x=dose, y= mean.len), size=4, shape=18, position=pd) +
  geom_errorbar(aes(ymin=mean.len-sd.len, ymax=mean.len+sd.len), width=.1, position=pd) +
  geom_line(position=pd) +
  geom_point(data=ToothGrowth, aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5, position=pd) +


# Here I add the titles I want
  labs(x="Supplement Dose",
       y="Tooth Length",
       title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
       subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
       caption = "Based on the data from R") +

# Change the legend
  guides(colour=guide_legend("Supplement"), fill=guide_legend("Supplement"))

tgp

8 Bonus: Plotting Time

The dataset we’ll use contains the daily closing prices of major European stock indices: Germany DAX (Ibis), Switzerland SMI, France CAC, and UK FTSE. Be aware that this is a slightly modified version of the EuStockMarkets a dataset that comes with R and the dates in the data was simulated by me and doesn’t represent the actual dates. But it already comes in a friendly date format, so when we import it, R will already now that this columns is in date format.

euStock <- read_csv("data/EUstock_1992-1997.csv")
head(euStock)
DAX SMI CAC FTSE date
1577.26 1670.1 1765.7 2493.1 1992-01-01
1577.26 1670.1 1765.7 2493.1 1992-01-02
1598.19 1670.1 1749.9 2492.8 1992-01-03
1604.05 1704.0 1770.3 2504.1 1992-01-06
1604.69 1711.8 1787.6 2493.2 1992-01-07
1593.65 1700.5 1778.7 2482.9 1992-01-08

As we already know, this dataset can be cleaner. We would rather have all the stock indices listed under one column for plotting.

euStock <- euStock %>%
  gather(Stock, Closing, -date) 
head(euStock)
date Stock Closing
1992-01-01 DAX 1577.26
1992-01-02 DAX 1577.26
1992-01-03 DAX 1598.19
1992-01-06 DAX 1604.05
1992-01-07 DAX 1604.69
1992-01-08 DAX 1593.65

This looks much better!

Now let’s plot it!

p <- euStock %>%
ggplot(aes(x=date, y=Closing, color = Stock)) +
  geom_line()
p

8.1 Customize Time Format

Now we can decide which date format we want to use for plotting. The table below explains how to provide the date format. Click here for more details.

Here some examples:

p+scale_x_date(date_labels = "%b")

p+scale_x_date(date_labels = "%Y %b %d")

p+scale_x_date(date_labels = "%m-%Y")

8.2 Customize Time Scale

To see the x-axis, let’s angle out x-axis tick labels so we can see them clearly.

p<- p + theme(axis.text.x=element_text(angle=60, hjust=1, size=8))

p + scale_x_date(date_breaks = "1 month", date_labels = "%b")

p + scale_x_date(date_breaks = "3 month", date_labels = "%b %y")

p + scale_x_date(date_breaks = "3 month", date_labels = "%b %y",
                 date_minor_breaks = "2 week") 

8.3 Limit Time Scale

You can also plot a specific period by giving a range with first and last date of the period.

p + scale_x_date(limit=c(as.Date("1994-01-01"),as.Date("1996-01-31")))